Part I: R Introduction

What is R?

R is a programming language used for statistical analysis and graphics. It is based on S-plus, which itself was based on S, a programming language originally developed by AT&T.

Why R?

  • Open source, cross-platform, and free
  • Great for reproducibility
  • Interdisciplinary and extensible
  • Tons of learning resources
  • Works on data of all shapes and sizes
  • Produces high-quality graphics
  • Large and welcoming community

R: Object-Oriented Programming

Unlike many other statistical software such as SAS and SPSS, R will not spit out a mountain of output on the screen.

Instead, R returns an object containing all the results. You, as an user, have the flexibility to choose which result to be extracted or reported.

R: Functional Programming

This feature allows us to write faster yet more compact code. For example, a common theme in R programming is avoidance of explicit iteration. Unlike many other statistical softwares, explicit loops are discouraged.

Instead, R provides some functions that could allow us to express iterative behavior implicitly.

R: Polymorphic

R is also polymorphic, which means that a single function can be applied to different types of inputs (much more user friendly).

Such a function is called a generic function (If you are a C++ programmer, you have seen a similar concept in virtual functions).

Polymorphic - Example

Lets look at one example plot()

  1. Plot a vector of numbers
  2. Plot some model results

No matter which purpose, we use the same function.

data<-c(1,2,3,4)
plot(data)

# Regression Analysis
par(mfrow=c(2,2),mar=c(2,4,2,2))
results<-lm(speed ~ dist,data=cars)
plot(results)

Why R Studio?

  • R Interface is ugly!

  • Many students in this class are much more familiar with Windows operation system and have never been exposed programming before, so we will use R studio, one of the free Graphical User Interfaces (GUIs) that have been developed for R.

    • R studio should really be considered as integrated development environments (IDEs), since it is aimed more toward programming.
  • Easy publishing of reproducible documents such as reports, interactive visualizations, presentations, and websites.

R Studio: A short tour

Initial Start

When you first (like very first time) open R studio you will see three panels.


Console

  1. Every time you launch RStudio, it will have the same text at the top of the console telling you the version of R that you’re running.
  2. Below that information is the prompt, > . As its name suggests, this prompt is really a request, a request for a command.
  3. Initially, interacting with R is all about typing commands and interpreting the output.
  4. These commands and their syntax have evolved over decades (literally) and now provide what many users feel is a fairly natural way to access data and organize, describe, and invoke statistical computations.

The console is where you type commands and have them immediately performed.


Environment

The panel in the upper right contains your workspace (aka Environment)

  1. This shows you a list of objects/variables that R has saved.
  2. For example here a value of 3 has been assigned to the object a.

History

Up here there is an additional tab to see the history of the commands that you’ve previously entered.


Files

The files tab allows you to open code/script files within R studio.


Plots

Any plots that you generate will show up in the panel in the lower right corner.


Help

To check the syntax of any function in R, type ? in front of the function name to pull up the help file.

For example here I typed ?mean to get the help file for the mean function. The help files are not always the most useful but are usually a good place to start.


Script File The top left is your editor window, where you write code or script, the console is now at the bottom. I usually change it

The picture above illustrates my preferred style in R Studio.

R Script

Most of R users typically submit commands to R by typing either in console or editor panel, rather than clicking a mouse in a Graphical User Interface (GUI).

In this class, we will make extensive use of scripts. A Script is nothing but a collection of commands and procedures that the coder performed to get to their results and conclusions..

There are at least two advantages of doing so:

  1. As explained earlier, this allows us to run a bunch of results altogether by putting a collection of commands in a file.
  2. It is also a lot more transparent and straightforward to share and replicate what you have done.

This will always be our approach in this class!!!


Exercise

Task 1: Create a script file

  1. Open R Studio and go to File > New > R Script.

This will open a blank text document.

  1. In the document, type
x = 5  # Assign the variable x a value of 5
x == 5  # Does x = 5? Notice the double ==
  • Highlight both lines of code and click the button marked “Run”. If everything is working correctly, the console should display TRUE.

  • OR, pressing or depending on whether you’re running Mac OSX, Linux or Windows.

  1. Go to “File > Save As”, and choose a file name.

Part II: Working with Scripts

Comments

Whenever possible, use comments! Anything following the symbol # in an R Script will not be run in R.

Comments are notes we leave ourselves so we know:

  • exactly who wrote the code (important in companies where many people may work on a project)
  • the purpose of the code!
  • what our thought process was at a particular line of code.

I promise that this will become useful when you come back to your code after an extended time. I cannot tell you the number of times I have had a moment of pure genius while coding and I spend hours on a different day trying to understand why I coded it like that or what I actually did.


For example, below is the type of comments that I always include in my programs

# Project: 'Tutorial on R Studio II'
# Author:  Shamar Stewart
# This program illustrates some basic programming philosophy
# and R operations

You can also understand the following code without even knowing what exactly each line of command does because I tell you what they are!

# Set seed number so that all the results based on random samples 
# are reproducible.
  set.seed(12345)
# Then create a normally distributed random variable, x, with 500 
# observations.
  x <- rnorm(500)
# Notice "<-" is the universal assignment operator in R (I prefer this to "=")

Exercise

Task 2:

At the top of the previous script (Task 1), write add and expand on the following comments:

  1. The project
  2. The author
  3. The purpose of this program

Follow the example given above.


R Basics

Arithmetic

  
  1 + 1 #add numbers
## [1] 2
  
  8 - 4 #subtract them
## [1] 4
  
  13/2 #divide
## [1] 6.5
  
  4*pi #multiply (Pi is a built in function in R)
## [1] 12.56637
  
  2^10 #exponentiate
## [1] 1024

Logical Comparison

Logical arguments will result in a value of TRUE or FALSE.

  3 < 4
## [1] TRUE
  3 > 4
## [1] FALSE
  3 == 4
## [1] FALSE
  3 != 4
## [1] TRUE
  10 - 6 == 4
## [1] TRUE
  
  # Notice the difference between single and double equal signs

Now try 3 = 4. What is the result here?

Strings (text)

#R delimits strings with EITHER double or single quotes.
#There is only a very minimal difference
  
message1 <- 'Let us get to coding!'  
message2 <-  "Please get to coding!"  
print(message1)
## [1] "Let us get to coding!"
print(message2)
## [1] "Please get to coding!"

We can also print the result(s) stored in our variables by simply running the running the variable name instead of print().

message1
## [1] "Let us get to coding!"

Variables

  • variable are used to store values and results. Assignment to a variable happens from right to left - the value on the right side gets assigned to the name on the left side. You can use nearly anything as a variable name in R. The only rules are:
  1. “.” and "_" are OK to be added to variable names, but no other symbols.

  2. Your variable name must not start with a number or _ (2squared and _one are illegal).

  • A note for those of you who have programming experience: while R supports object-oriented programming, periods “.” do not have a special meaning in the language. For historical reasons, R programmers often use periods in place of underscores in variable names, but either works. Just be consistent to keep your code readable.

  • R is case sensitive. Capitalization of variable names matter.

    x <- 42
    x / 2
# [1] 21

    # redefine x
    x <- x + 3
    x
# [1] 45
    
    #if we assign something else to x, the old value is deleted        
    x <- "Hokies!"
    x
# [1] "Hokies!"
    
    foo <- 3 
    bar <- 5
    foo.bar <- foo + bar
    foo.bar
# [1] 8

Exercise

Task 3:

  1. Create a variable called entry that stores the year you started at Virginia Tech.
  2. Store the current year to a variable called current_t.
  3. Compute the difference between current_t and entry. Store this as diffs.
  4. Store your birth year as my_year. Now compute the difference between current_t and my_year. Assign the results to my_diffs.
  5. Use this information to compute the percentage of your life have spent at this university. Be sure to use brackets if you need them.
  6. Assign this result to a variable of your choosing.

Clearing the memory

To remove all variables in memory:

#    ls() # List of all variables in memory
    rm(list=ls())
  • I usually place this at the beginning of my R script (just after the document details).

Part III: R Data types

Data types in R

You have observed a few of the different data types in the earlier sections. Here, we will formally discuss them. Some of the most basic data types we will cover are:

  1. Decimal values like 4.5 are called numerics.
  2. Natural numbers like 4 are called integers. Integers are also numerics.
  3. Boolean values (TRUE or FALSE) are also called logical.
  4. Text (or string) values are called characters.

You can check the type of data by using class().

  x <- "Lyrics to Virginia Tech Fight Song!"
  class(x)
## [1] "character"
  x2 <- c("TRUE", "FALSE")
  x2 <- as.logical(x2) #Declare the data type
  x2
## [1]  TRUE FALSE
  class(x2)
## [1] "logical"
  x <- 1:20
  x %% 4 #x mod 4
##  [1] 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0
  x %% 4 == 0
##  [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
## [13] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE
  class(x %% 4 == 0)
## [1] "logical"

Vectors

A vector is the most common and basic data type in R, and is pretty much the workhorse of R. A vector is characterized by a series of values, which can be either numbers or characters. We can assign a series of values to a vector using the c() function.

In short, vectors are most useful when we have a collection of data points.

Here c() stands for concatenate or combine.

(v <- c(1, 2, 3, 4))
## [1] 1 2 3 4
(v <- 1:4)
## [1] 1 2 3 4
(v <- seq(from = 0, to = 0.5, by = 0.1))
## [1] 0.0 0.1 0.2 0.3 0.4 0.5
#A vector can also contain characters:  
(v_colors <- c("blue", "yellow", "light green") )
## [1] "blue"        "yellow"      "light green"

Notice that by encasing the beginning and end of the assignment lines in parentheses, we immediately print the stored values.

Subsetting vectors (Indexing/reassigning elements)

We are able to index (collect subsets of our variables) by using squared brackets. Unlike python, for example, R’s indexing begins from 1.

v_colors[2] # We are trying to extract the second element of the vector, v_colors        
## [1] "yellow"
v_colors[c(1,3)]  # We can use the concatenation function to get nonconsecutive elements. Here, we are trying to extract elements in positions 1 and 3.
## [1] "blue"        "light green"

How would your extract elements 1:9, 15, 19, 20 and 21:30 in zz below?

set.seed(1234)
zz <- rnorm(100)

Answer:

zz[c(1:19,15, 19, 20:30)]
##  [1] -1.20706575  0.27742924  1.08444118 -2.34569770  0.42912469  0.50605589
##  [7] -0.57473996 -0.54663186 -0.56445200 -0.89003783 -0.47719270 -0.99838644
## [13] -0.77625389  0.06445882  0.95949406 -0.11028549 -0.51100951 -0.91119542
## [19] -0.83717168  0.95949406 -0.83717168  2.41583518  0.13408822 -0.49068590
## [25] -0.44054787  0.45958944 -0.69372025 -1.44820491  0.57475572 -1.02365572
## [31] -0.01513830 -0.93594860

We can replace elements in specific positions. Below, we replace the second and third colors with red and purple.

(v_colors[2:3]  <- c("red", "purple")   )
## [1] "red"    "purple"

Sometimes it might be more convenient to get rid of particular elements instead. For example, I might want to extract all but the first 5 elements of a vector, or all but the 15th element. We might find it easier to use a negative index here.

j <- c(-1,-2,-3)
x[j]
##  [1]  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

# We could have done that in one go as well
x[-c(1:3)]
##  [1]  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

Conditional subsetting

Another common way to subset is by using a logical vector. TRUE will select the element with the same index, while FALSE will not. Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests such as:

x <- 100:110
x
##  [1] 100 101 102 103 104 105 106 107 108 109 110
x >105 # returns TRUE or FALSE depending on which elements that meet the condition
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
select <- x > 105
x[select]
## [1] 106 107 108 109 110

If we would like the elements that evaluate to FALSE instead, we could easily use the ! (NOT) operator

x[!select]
## [1] 100 101 102 103 104 105

You can combine multiple tests using:

  • & (AND operator - both conditions are true) or

  • | (OR operator - at least one of the conditions is true)

We can test whether x is between the range 103 and 106:

x[x >= 103 & x <= 106]
## [1] 103 104 105 106

x is greater than 103 but (AND) less than or equal to 106

x[x <= 106 & x > 103] # order of subsetting does not matter here!
## [1] 104 105 106

x is less than 103 or greater than 106

x[x >= 106 | x < 103] 
## [1] 100 101 102 106 107 108 109 110

Sometimes we will need to search for certain strings in a vector. With multiple conditions, it becomes difficult to use the “OR” operator |. The function %in% allows you to test if any of the elements of a search vector are found:

animals <- c("mouse", "rat", "dog", "cat")
animals[animals == "cat" | animals == "rat"] # returns both rat and cat
## [1] "rat" "cat"
animals %in% c("rat", "cat", "dog", "duck", "goat")
## [1] FALSE  TRUE  TRUE  TRUE
animals[animals %in% c("rat", "cat", "dog", "duck", "goat")]
## [1] "rat" "dog" "cat"

Names of a vector

Let’s say that we want to know which color robe each of 3 patients is wearing, we can assign names to the vector of colors.

v_colors
## [1] "blue"   "red"    "purple"
names(v_colors) <- c("Thomas", "Liz", "Tucker")
v_colors
##   Thomas      Liz   Tucker 
##   "blue"    "red" "purple"

Algebraic Operations of Vectors

x <- c(1,2,3)
y <- c(4,5,6)
# component-wise addition
x+y
## [1] 5 7 9
# component-wise multiplication
x*y
## [1]  4 10 18
# What happens to the following
y^x # or y**x
## [1]   4  25 216

Repeating Vector in R

# Would this work?
c(1,2,3,4) + c(1,2)
## [1] 2 4 4 6
# Would this work?
c(1,2,3) + c(1,2)
## [1] 2 4 4

Why the weird results?

  • When you are adding vectors of unequal size, if the long one is a multiple of the short one, R automatically repeats the short one to fill in the operation.
2*c(1,2,3)
## [1] 2 4 6

Matrix

Create a new matrix

(matrix<-matrix(1:16, nrow = 4, byrow = TRUE))
##      [,1] [,2] [,3] [,4]
## [1,]    1    2    3    4
## [2,]    5    6    7    8
## [3,]    9   10   11   12
## [4,]   13   14   15   16

Note that : means every number from 1 to 4. In the matrix() function:

  1. The first argument is the collection of elements that R will arrange into the rows and columns of the matrix. Here, we use 1:16 which is a shortcut for c(1, 2, 3, 4, … 16).
  2. The argument byrow indicates that the matrix is filled by the rows. If we want the matrix to be filled by the columns, we use byrow = FALSE.
  3. The argument nrow indicates that the matrix should have 4 rows.

Selection of Matrix Elements

Selection of the matrix elements are similar to vectors except we have two dimensions over which to subset- rows and columns.

# matrix[r,c] #Standard form of the matrix.

matrix[1,2] #Extract element in the first row and second column
## [1] 2
#Extract the entire first row and second columns 
matrix[,1:2] 
##      [,1] [,2]
## [1,]    1    2
## [2,]    5    6
## [3,]    9   10
## [4,]   13   14

Assign dimension names to Matrix

    rownames(matrix) <- c("Yes", "No", "Perhaps", "Maybe")
    colnames(matrix) <- c("Apple", "Pear", "Banana", "Grapes")
    matrix
##         Apple Pear Banana Grapes
## Yes         1    2      3      4
## No          5    6      7      8
## Perhaps     9   10     11     12
## Maybe      13   14     15     16

Dimension of a matrix vs vector

x <- c(1,2,3)
matrix<-matrix(1:4, byrow = TRUE, nrow = 2)
length(x)
## [1] 3
length(matrix)
## [1] 4
dim(matrix)
## [1] 2 2
dim(x)
## NULL

Lists

R doesn’t like vectors to have different types: c(TRUE, 1, "Frank") becomes c("TRUE", "1", "Frank"). But storing objects with different types is absolutely fundamental to data analysis. R has a different type of object besides a vector used to store data of different types side-by-side: a list:

c(TRUE, 1, "Frank")
## [1] "TRUE"  "1"     "Frank"
x <- list(TRUE, 1, "Frank")

Many different things not necessarily of same length can be put together.

x <- list(c(1:5), c("a", "b","c"), c(TRUE, FALSE), c(5L, 6L))

Dataframes

  • Data frames are like spreadsheet data, rectangular with rows and columns.
  • Ideally each row represents data on a single observation and each column contains data on a single variable, or characteristic, of the observation.
  • It represents the data in a tabular format where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors).
  • We can open a data viewer window to see the contents of R’s iris data frame by typing.
  • We will be working with spreadsheets a lot.
View(iris)

Create a data frame

Data frame with Harry Potter characters

    name <- c("Harry", "Ron", "Hermione", "Hagrid", "Voldemort")    
    height <- c(176, 175, 167, 230, 180)    
    gpa <- c(3.4, 2.8, 4.0, 2.2, 3.4)   
    df_students <- data.frame(name, height, gpa)        
    df_students 
##        name height gpa
## 1     Harry    176 3.4
## 2       Ron    175 2.8
## 3  Hermione    167 4.0
## 4    Hagrid    230 2.2
## 5 Voldemort    180 3.4

Alternative way of creating DF

    df_students <- data.frame(name = c("Harry", "Ron", "Hermione", "Hagrid",
                       "Voldemort"),    
                  height = c(176, 175, 167, 230, 180),  
                  gpa = c(3.4, 2.8, 4.0, 2.2, 3.4)) 
    df_students
##        name height gpa
## 1     Harry    176 3.4
## 2       Ron    175 2.8
## 3  Hermione    167 4.0
## 4    Hagrid    230 2.2
## 5 Voldemort    180 3.4

Adding variable

    df_students$good <- c(1, 1, 1, 1, 0)    
    df_students 
##        name height gpa good
## 1     Harry    176 3.4    1
## 2       Ron    175 2.8    1
## 3  Hermione    167 4.0    1
## 4    Hagrid    230 2.2    1
## 5 Voldemort    180 3.4    0

Features of the DF

    dim(df_students)        
    df_students[2, 3]               #Ron's GPA
    df_students$gpa[2]              #Ron's GPA
    df_students[5, ]                #get row 5
    df_students[3:5, ]              #get rows 3-5
    df_students[, 2]                #get column 2 (height)
    df_students$height              #get column 2 (height)
    df_students[, 1:3]              #get columns 1-3
    df_students[4, 2] <- 255        #reassign Hagrid's height
    df_students$height[4] <- 255    #same thing as above
    df_students     

Exercise

Now that you are equipped with the basic, go ahead and take the following Datacamp Course, R Intro on Datacamp. Your invitations should now be in your inbox.

Part IV: Working directories

You can use the

getwd()

command to obtain the current directory R is using.

It is good practice to set the working directory location to where the files and data are stored.

  • Consider setting your working directory to a folder called AAEC498, AAEC5484, or STAT5484 on your desktop (for example).

Creating Directory and Set working directory

Windows

  setwd("C:/users/[your user name]/Desktop/AAEC4984/")
  # OR
  setwd("C:\\users\\[your user name]\\Desktop\\AAEC4984\\") 
  # notice the double backslashes

Mac

  setwd("~/Desktop/AAEC4984")
  • To check whether the wd is correct, we again use
getwd()
  • To obtain a list of the names of files or folders in the working directory, we can use
  dir()
  • To create a new folder in your directory we can use

Importing data

R allows us to import several file types. I will discuss 3 that we are most likely to use in this course.

  1. Text files
    Data sometimes come with headers (the first row is variable names, not actual data!) You need to tell R that!
  textdata<-read.table("examples/hogsdata.txt",header=T)
  1. CSV files :

  2. xlsx files (requires openxlsx package)

  xlsxdata<-read.csv("examples/hogsdata.xlsx", ... )

Functions & Packages

Functions are “canned scripts” that automate more complicated sets of commands including operations assignments, etc. For the purpose of this course, we will use a lot of functions that are built both in base R (that is, they are predfined) or available through R packages (discuss below).

A function usually takes one or more inputs called arguments, and often (but not always) return a value.


Consider for example, taking the average of a set of random numbers (x).

set.seed(124) 
x <- rnorm(6) * 100
(round(x, digits=2)) # round function => 2dp
## [1] -138.51    3.83  -76.30   21.23  142.55   74.45

If we were to do this manually, we would:

  1. Sum up the values
sumx <- sum(x)
  1. Get the number of observations
nx <- length(x)
  1. Divide sum by total number of observations
meanx <- sumx/nx

Using R’s built in mean function we can do all three steps internally and cross check against our manual calculations.

mean(x)
## [1] 4.542439
meanx == mean(x) # cross validation
## [1] TRUE

Installing Packages

Since R is an Open Source software program, thousands of people contribute to the software. They do this by writing commands (called functions) to make a particular analysis easier, or to make a graphic prettier.

When you download R, you get access to a lot of functions that we will use. However the other user-written packages we use for our analyses will make our lives much easier.

For example, though we can use the plot command for standard graphics, you will quickly see that we can get much better looking time graphs using the fpp2 package (which also uses another package called ggplot2).

Installing Packages

To install the fpp2 package, we can use the command

install.packages("fpp2")

We will need to install a package only once in R.

Now that you have the fpp2 package installed, we can check to see if it is in use

search()

Lastly, in order to use the package, we will need to load the library

library(fpp2)

Using libraries

The fpp2 package contains a number of useful datasets. One such data set is h02.

Use the help() function to get a decription of this data. Try

help(h02)

Now let us create a nice plot of the h02 data

autoplot(h02)

Let us leave it there for now!